[k8s] Realtime GPU availability of kubernetes cluster in `sky show-gpus` #3499

romilbhardwaj · 2024-04-30T06:33:41Z

Closes #2839 and and #3448. Shows realtime availability of GPUs on the cluster when --cloud kubernetes is passed to sky show-gpus.

Examples

On a kubernetes cluster with the following configuration:

2x T4:4 nodes, for a total of 8 T4 GPUs
2x V100:2 nodes, for a total of 4 V100 GPUs
2 jobs running - 1 using T4:2 and another using V100:2.

$ sky show-gpus --cloud kubernetes     
GPU   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS  
T4    1, 2, 3, 4    8           6               
V100  1, 2          4           2               

# With name and quantity filter
$ sky show-gpus T4:2 --cloud kubernetes
GPU  QTY_FILTER  TOTAL_GPUS  FILTERED_FREE_GPUS  
T4   2           8           6               

# With name and checking for 4x GPU. Note that `AVAILABLE_GPUS` is now 4, since only 4 GPUs are available as a set on a single node (the other node has only 2 (out of 4) GPUs available).
$ sky show-gpus T4:4 --cloud kubernetes
GPU  QTY_FILTER  TOTAL_GPUS  FILTERED_FREE_GPUS  
T4   4           8           4               

# Without cloud filter, behavior remains unchanged
$ sky show-gpus T4:4                   
GPU  QTY  CLOUD       INSTANCE_TYPE          DEVICE_MEM  vCPUs  HOST_MEM  HOURLY_PRICE  HOURLY_SPOT_PRICE  REGION       
T4   4    AWS         g4dn.12xlarge          16GB        48     192GB     $ 3.912       $ 1.378            us-west-2    
T4   4    Azure       Standard_NC64as_T4_v3  -           64     440GB     $ 4.352       $ 0.435            eastus       
T4   4    GCP         n1-standard-64         16GB        64     240GB     $ 4.440       $ 1.168            us-central1  
T4   4    Kubernetes  (attachable)           -           -      -         $ 0.000       $ 0.000            kubernetes   

# ======= Error messages ==========

# GPU not present on the cluster
$ sky show-gpus L4 --cloud kubernetes
No GPUs matching name 'L4' found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerator) are setup correctly. To list all available accelerators, run: sky show-gpus --cloud kubernetes.

# Checking for more quantity than is available
$ sky show-gpus T4:8 --cloud kubernetes
No GPUs matching name 'T4' with quantity 8 found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerator) are setup correctly. To list all available accelerators, run: sky show-gpus --cloud kubernetes.

# On a cluster with no GPUs (e.g., `sky local up`)
$ sky show-gpus --cloud kubernetes 
No GPUs found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerator) are setup correctly. To further debug, run: sky check.

Tested (run the relevant ones):

Code formatting: bash format.sh
Rendered docs
Manual tests with the examples above and on a no GPU kubernetes cluster
pytest tests/test_list_accelerators.py

romilbhardwaj · 2024-04-30T19:21:17Z

Ran some tests + updated cli docs. Ready for review.

Michaelvll

Thanks for adding the support for this @romilbhardwaj! Mostly looks good to me with minor nits. Just tried it on a k8s cluster with GPUs and it seems working well.

sky/cli.py

Michaelvll · 2024-05-07T22:08:41Z

sky/clouds/service_catalog/kubernetes_catalog.py

+    return list_accelerators_realtime(gpus_only, name_filter, region_filter,
+                                      quantity_filter, case_sensitive,
+                                      all_regions, require_price)[0]


Will calling this adds additional overhead to the list_accelerators? Since we are relying on the list_accelerators to generate the optimization candidate resources, which will be called multiple times during the failover process. Would be nice to make sure this does not add overhead. : )

That's a good point.. the overhead compared to a the previous implementation isn't much different since the previous implementation was also invoking the kubernetes API:

This branch: multitime -n 5 sky launch --dryrun -y --gpus T4:1 ===> multitime results 1: sky launch --dryrun -y --gpus T4:1 Mean Std.Dev. Min Median Max real 3.883 0.064 3.782 3.883 3.982 user 2.775 0.081 2.654 2.766 2.871 sys 3.136 0.285 2.676 3.268 3.448 Master: multitime -n 5 sky launch --dryrun -y --gpus T4:1 1: sky launch --dryrun -y --gpus T4:1 Mean Std.Dev. Min Median Max real 3.863 0.032 3.829 3.860 3.917 user 2.713 0.023 2.670 2.716 2.735 sys 3.438 0.097 3.267 3.471 3.535

That said, we should put a lru cache with a time-to-live (TTL) to expire based on time. Added a TODO.

sky/cli.py

…o k8s_show_gpus_availability # Conflicts: # sky/cli.py

romilbhardwaj · 2024-05-18T00:21:14Z

Thanks @Michaelvll! Ready for another look.

Michaelvll

Thanks for the update @romilbhardwaj! LGTM. IIRC, we may want to have a separate section for the k8s table in sky show-gpus without any argument, so that it can be easier to distinguish those "on-prem" GPUs.

Also, it seems sky show-gpus t4 does not contain the kubernetes cluster, although sky show-gpus --cloud kubernetes does show the T4 GPUs. Can we show the k8s section in sky show-gpus t4 as well?

…o k8s_show_gpus_availability

romilbhardwaj · 2024-05-24T02:06:03Z

Thanks @Michaelvll - I've made some updates:

Thanks for catching the case sensitivity bug! It's fixed now - sky show-gpus t4 or sky show-gpus T4 will show:

(base) ➜  ~ sky show-gpus t4
GPU  QTY  CLOUD       INSTANCE_TYPE          DEVICE_MEM  vCPUs  HOST_MEM  HOURLY_PRICE  HOURLY_SPOT_PRICE  REGION
T4   1    Kubernetes  (attachable)           -           -      -         $ 0.000       $ 0.000            kubernetes
T4   2    Kubernetes  (attachable)           -           -      -         $ 0.000       $ 0.000            kubernetes
T4   3    Kubernetes  (attachable)           -           -      -         $ 0.000       $ 0.000            kubernetes
T4   4    Kubernetes  (attachable)           -           -      -         $ 0.000       $ 0.000            kubernetes
T4   1    Azure       Standard_NC4as_T4_v3   -           4      28GB      $ 0.526       $ 0.053            eastus
...

I've updated sky show-gpus to show Kubernetes GPUs in a separate table (in the examples below, P500 is a dummy GPU I created on one of the nodes to simulate any non-canonical GPUs that users may have on their cluster):

===== When Kubernetes is enabled and has GPUs =====
$ sky show-gpus
COMMON_GPU  AVAILABLE_QUANTITIES
A10         1, 2, 4
A10G        1, 4, 8
A100        1, 2, 4, 8, 16
A100-80GB   1, 2, 4, 8
H100        1, 2, 4, 6, 8, 12
K80         1, 2, 4, 8, 16
L4          1, 2, 4, 8
M60         1, 2, 4
P100        1, 2, 4
T4          1, 2, 3, 4, 8
V100        1, 2, 4, 8
V100-32GB   1, 2, 4, 8

KUBERNETES_GPU  QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
P500            1, 2, 3, 4    4           4
T4              1, 2, 3, 4    8           8
V100            1, 2          4           4

GOOGLE_TPU   AVAILABLE_QUANTITIES
tpu-v2-8     1
tpu-v2-32    1
tpu-v2-128   1
tpu-v2-256   1
tpu-v2-512   1
tpu-v3-8     1
tpu-v3-32    1
tpu-v3-64    1
tpu-v3-128   1
tpu-v3-256   1
tpu-v3-512   1
tpu-v3-1024  1
tpu-v3-2048  1

$ sky show-gpus --cloud kubernetes
GPU   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
P500  1, 2, 3, 4    4           4
T4    1, 2, 3, 4    8           8
V100  1, 2          4           4

===== When Kubernetes is enabled but does not have GPUs =====
$ sky show-gpus
COMMON_GPU  AVAILABLE_QUANTITIES
A10         1, 2, 4
A10G        1, 4, 8
A100        1, 2, 4, 8, 16
A100-80GB   1, 2, 4, 8
H100        1, 2, 4, 6, 8, 12
K80         1, 2, 4, 8, 16
L4          1, 2, 4, 8
M60         1, 2, 4
P100        1, 2, 4
T4          1, 2, 4, 8
V100        1, 2, 4, 8
V100-32GB   1, 2, 4, 8

No GPUs found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerator) are setup correctly. To further debug, run: sky check.

GOOGLE_TPU   AVAILABLE_QUANTITIES
tpu-v2-8     1
tpu-v2-32    1
tpu-v2-128   1
tpu-v2-256   1
tpu-v2-512   1
tpu-v3-8     1
tpu-v3-32    1
tpu-v3-64    1
tpu-v3-128   1
tpu-v3-256   1
tpu-v3-512   1
tpu-v3-1024  1
tpu-v3-2048  1

Hint: use -a/--all to see all accelerators (including non-common ones) and pricing.

$ sky show-gpus --cloud kubernetes
No GPUs found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerator) are setup correctly. To further debug, run: sky check.

===== When Kubernetes is not enabled =====
$ sky show-gpus
COMMON_GPU  AVAILABLE_QUANTITIES
A10         1, 2, 4
A10G        1, 4, 8
A100        1, 2, 4, 8, 16
A100-80GB   1, 2, 4, 8
H100        1, 2, 4, 6, 8, 12
K80         1, 2, 4, 8, 16
L4          1, 2, 4, 8
M60         1, 2, 4
P100        1, 2, 4
T4          1, 2, 4, 8
V100        1, 2, 4, 8
V100-32GB   1, 2, 4, 8

GOOGLE_TPU   AVAILABLE_QUANTITIES
tpu-v2-8     1
tpu-v2-32    1
tpu-v2-128   1
tpu-v2-256   1
tpu-v2-512   1
tpu-v3-8     1
tpu-v3-32    1
tpu-v3-64    1
tpu-v3-128   1
tpu-v3-256   1
tpu-v3-512   1
tpu-v3-1024  1
tpu-v3-2048  1

Hint: use -a/--all to see all accelerators (including non-common ones) and pricing.

Michaelvll

Thanks for the update @romilbhardwaj!

I found having kubernetes GPUs mixed with the cloud tables a bit weird in sky show-gpus t4.

One idea: we just have two sections, one for clouds, and one for k8s? For the k8s section, we just show the real-time availability table.

Similarly for sky show-gpus, we can have two sections, each with a title, e.g., Clouds, Kubernetes (similar to our sky status with three sections for clusters, jobs, and services).

We can have the Kubernetes section at the top so as to make all the cloud tables more connected together : )

sky/cli.py

romilbhardwaj · 2024-05-24T23:18:31Z

Thanks @Michaelvll - here's the latest behavior to help review:

===== When Kubernetes is enabled and has GPUs =====
(base) ➜  ~ sky show-gpus
Kubernetes GPUs
GPU   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
P500  1, 2, 3, 4    4           4
T4    1, 2, 3, 4    8           8
V100  1, 2          4           4

Cloud GPUs
COMMON_GPU  AVAILABLE_QUANTITIES
A10         1, 2, 4
A10G        1, 4, 8
A100        1, 2, 4, 8, 16
A100-80GB   1, 2, 4, 8
H100        1, 2, 4, 6, 8, 12
K80         1, 2, 4, 8, 16
L4          1, 2, 4, 8
M60         1, 2, 4
P100        1, 2, 4
T4          1, 2, 4, 8
V100        1, 2, 4, 8
V100-32GB   1, 2, 4, 8

GOOGLE_TPU   AVAILABLE_QUANTITIES
tpu-v2-8     1
tpu-v2-32    1
tpu-v2-128   1
tpu-v2-256   1
tpu-v2-512   1
tpu-v3-8     1
tpu-v3-32    1
tpu-v3-64    1
tpu-v3-128   1
tpu-v3-256   1
tpu-v3-512   1
tpu-v3-1024  1
tpu-v3-2048  1

Hint: use -a/--all to see all accelerators (including non-common ones) and pricing.

$ sky show-gpus --cloud kubernetes
Kubernetes GPUs
GPU   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
P500  1, 2, 3, 4    4           4
T4    1, 2, 3, 4    8           8
V100  1, 2          4           4

# GPU that only exists in kubernetes
$ sky show-gpus P500
Kubernetes GPUs
GPU   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
P500  1, 2, 3, 4    4           4

Cloud GPUs
Resources 'P500' not found in cloud catalogs. To show available accelerators, run: sky show-gpus --all

# GPU that doesn't exist in Kubernetes
$ sky show-gpus H100
Kubernetes GPUs
Resources 'H100' not found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerator) are setup correctly. To show available accelerators on kubernetes, run: sky show-gpus --cloud kubernetes

Cloud GPUs
...

$ Invalid GPU name
$ sky show-gpus K9000
Kubernetes GPUs
Resources 'K9000' not found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerator) are setup correctly. To show available accelerators on kubernetes, run: sky show-gpus --cloud kubernetes

Cloud GPUs
Resources 'K9000' not found in cloud catalogs. To show available accelerators, run: sky show-gpus --all

===== When Kubernetes is enabled but does not have GPUs =====
(base) ➜  ~ sky show-gpus
COMMON_GPU  AVAILABLE_QUANTITIES
A10         1, 2, 4
A10G        1, 4, 8
A100        1, 2, 4, 8, 16
A100-80GB   1, 2, 4, 8
H100        1, 2, 4, 6, 8, 12
K80         1, 2, 4, 8, 16
L4          1, 2, 4, 8
M60         1, 2, 4
P100        1, 2, 4
T4          1, 2, 4, 8
V100        1, 2, 4, 8
V100-32GB   1, 2, 4, 8

GOOGLE_TPU   AVAILABLE_QUANTITIES
tpu-v2-8     1
tpu-v2-32    1
tpu-v2-128   1
tpu-v2-256   1
tpu-v2-512   1
tpu-v3-8     1
tpu-v3-32    1
tpu-v3-64    1
tpu-v3-128   1
tpu-v3-256   1
tpu-v3-512   1
tpu-v3-1024  1
tpu-v3-2048  1

Hint: use -a/--all to see all accelerators (including non-common ones) and pricing.

Note: No GPUs found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerator) are setup correctly. To further debug, run: sky check.

$ sky show-gpus --cloud kubernetes
No GPUs found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerator) are setup correctly. To further debug, run: sky check.


$ sky show-gpus --all
<Note is shown after the quantities before the start of the longer tables since that output can be quite long >
COMMON_GPU  AVAILABLE_QUANTITIES
...

GOOGLE_TPU   AVAILABLE_QUANTITIES
...

OTHER_GPU        AVAILABLE_QUANTITIES
...

Note: No GPUs found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerator) are setup correctly. To further debug, run: sky check.

GPU   QTY  CLOUD       INSTANCE_TYPE        DEVICE_MEM  vCPUs  HOST_MEM  HOURLY_PRICE  HOURLY_SPOT_PRICE
...

===== When Kubernetes is not enabled =====
$ sky show-gpus
COMMON_GPU  AVAILABLE_QUANTITIES
A10         1, 2, 4
A10G        1, 4, 8
A100        1, 2, 4, 8, 16
A100-80GB   1, 2, 4, 8
H100        1, 2, 4, 6, 8, 12
K80         1, 2, 4, 8, 16
L4          1, 2, 4, 8
M60         1, 2, 4
P100        1, 2, 4
T4          1, 2, 4, 8
V100        1, 2, 4, 8
V100-32GB   1, 2, 4, 8

GOOGLE_TPU   AVAILABLE_QUANTITIES
tpu-v2-8     1
tpu-v2-32    1
tpu-v2-128   1
tpu-v2-256   1
tpu-v2-512   1
tpu-v3-8     1
tpu-v3-32    1
tpu-v3-64    1
tpu-v3-128   1
tpu-v3-256   1
tpu-v3-512   1
tpu-v3-1024  1
tpu-v3-2048  1

Hint: use -a/--all to see all accelerators (including non-common ones) and pricing.

(base) ➜  ~ sky show-gpus --cloud kubernetes
Kubernetes is not enabled. To fix, run: sky check kubernetes

Michaelvll

Thanks @romilbhardwaj for updating this! It works great! LGTM!

sky/cli.py

Michaelvll · 2024-05-27T04:53:27Z

A minor point: for sky show-gpus -a, it would be nice to have the hint to be shown at the top instead of in the middle, since the latter is hard to see and find, especially we have the | less for the output.

romilbhardwaj · 2024-05-27T20:04:52Z

Thanks @Michaelvll! Moved the hint to the top for -a and simplified the logic a bit in 997bec1.

romilbhardwaj added 3 commits April 29, 2024 23:27

wip

e6b975d

filtering support

a6b5bfc

lint

1346159

romilbhardwaj marked this pull request as ready for review April 30, 2024 19:15

update doc

6bbbf25

Michaelvll self-requested a review May 1, 2024 05:02

rename headers

a263365

Michaelvll reviewed May 8, 2024

View reviewed changes

Michaelvll force-pushed the k8s_show_gpus_availability branch from dd02ab9 to a263365 Compare May 8, 2024 03:31

romilbhardwaj added 4 commits May 17, 2024 13:39

Merge branch 'master' of https://github.com/skypilot-org/skypilot int…

6dfb785

…o k8s_show_gpus_availability # Conflicts: # sky/cli.py

comments

0bd06a4

add TODO

6bf3045

Add autoscaler note

f960322

Michaelvll reviewed May 18, 2024

View reviewed changes

Merge branch 'master' of https://github.com/skypilot-org/skypilot int…

8e1821d

…o k8s_show_gpus_availability

romilbhardwaj added this to the v0.6 milestone May 23, 2024

romilbhardwaj added 5 commits May 23, 2024 17:41

case sensitive fix

8878254

case sensitive fix

3fe8fc6

show kubernetes GPUs in a separate table in sky show-gpus

2203d6b

lint

b75e471

lint

ba98957

romilbhardwaj added 3 commits May 23, 2024 19:11

fix for non-k8s cloud specified

b44b759

fix for region specified with k8s

57cc132

lint

4665386

Michaelvll reviewed May 24, 2024

View reviewed changes

sky/cli.py Outdated Show resolved Hide resolved

sky/cli.py Outdated Show resolved Hide resolved

romilbhardwaj added 2 commits May 24, 2024 12:23

show kubernetes in separate section

400336f

wip

3d3e121

romilbhardwaj added 3 commits May 24, 2024 15:49

move messages to the end

e13ba3d

lint

9e308e0

lint

8a36851

show sections if name is specified

db95895

Michaelvll approved these changes May 27, 2024

View reviewed changes

sky/cli.py Outdated Show resolved Hide resolved

romilbhardwaj added 4 commits May 27, 2024 10:30

comments

91a4356

lint

8e48e68

fix bugs and move warning for show_all to the top

997bec1

lint

72f08d9

romilbhardwaj merged commit e006a79 into master May 27, 2024
20 checks passed

romilbhardwaj deleted the k8s_show_gpus_availability branch May 27, 2024 21:23

Michaelvll mentioned this pull request May 29, 2024

[k8s] Kubernetes realtime availability #3006

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s] Realtime GPU availability of kubernetes cluster in `sky show-gpus` #3499

[k8s] Realtime GPU availability of kubernetes cluster in `sky show-gpus` #3499

romilbhardwaj commented Apr 30, 2024 •

edited

Loading

romilbhardwaj commented Apr 30, 2024 •

edited

Loading

Michaelvll left a comment

Michaelvll May 7, 2024

romilbhardwaj May 17, 2024

romilbhardwaj commented May 18, 2024

Michaelvll left a comment

romilbhardwaj commented May 24, 2024

Michaelvll left a comment •

edited

Loading

romilbhardwaj commented May 24, 2024

Michaelvll left a comment

Michaelvll commented May 27, 2024 •

edited

Loading

romilbhardwaj commented May 27, 2024

[k8s] Realtime GPU availability of kubernetes cluster in sky show-gpus #3499

[k8s] Realtime GPU availability of kubernetes cluster in sky show-gpus #3499

Conversation

romilbhardwaj commented Apr 30, 2024 • edited Loading

Examples

romilbhardwaj commented Apr 30, 2024 • edited Loading

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll May 7, 2024

Choose a reason for hiding this comment

romilbhardwaj May 17, 2024

Choose a reason for hiding this comment

romilbhardwaj commented May 18, 2024

Michaelvll left a comment

Choose a reason for hiding this comment

romilbhardwaj commented May 24, 2024

Michaelvll left a comment • edited Loading

Choose a reason for hiding this comment

romilbhardwaj commented May 24, 2024

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll commented May 27, 2024 • edited Loading

romilbhardwaj commented May 27, 2024

[k8s] Realtime GPU availability of kubernetes cluster in `sky show-gpus` #3499

[k8s] Realtime GPU availability of kubernetes cluster in `sky show-gpus` #3499

romilbhardwaj commented Apr 30, 2024 •

edited

Loading

romilbhardwaj commented Apr 30, 2024 •

edited

Loading

Michaelvll left a comment •

edited

Loading

Michaelvll commented May 27, 2024 •

edited

Loading